Web Documents Categorization using Fuzzy Representation and HAC
نویسندگان
چکیده
Most of the existing techniques for characterization of Web documents are based on term-frequent), analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. Howevel; as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the contents of such documents. Some recent studies have shown that the f u u y representation (FR) of WWW information based on SigniJicance of HTML tag is an effective alternative fo r characterizing Web documents. In this papel; the FR to generate the feature vector for each Web document and the Hierarchical Agglomerative Clustering (HAC) algorithm are applied to investigate the efficiency and tfectiveness fo r automatic categorization of Web documents with similar contents. Experiments conducted suggest several benefits of using such an approach.
منابع مشابه
A Visualization Approach to Automatic Text Documents Categorization Based on HAC
The ability to visualize documents into clusters is very essential. The best data summarization technique could be used to summarize data but a poor representation or visualization of it will be totally misleading. As proposed in many researches, clustering techniques are applied and the results are produced when documents are grouped in clusters. However, in some cases, user may want to know t...
متن کاملA General Fuzzy-Based Framework for Text Representation and Its Application to Text Categorization
In this paper we develop the general framework for text representation based on fuzzy set theory. This work is extended from our original ideas [5],[4], in which a document is represented by a set of fuzzy concepts. The importance degree of these fuzzy concepts characterize the semantics of documents and can be calculated by a specified aggregation function of
متن کاملGenerating and Applying Rules for Web Documents Retrieval
Web documents retrieval is very challenging due to the huge amount of documents available and difficulty to interpret these documents. Both effectiveness and efficency of retrieval are important. This paper presents some approaches from soft computing to improve effectiveness of web documents retrieval. These approaches give a more accurate and reasonable representation of terms provided by the...
متن کاملConceptual matching in web search using FIS-CRM for representing documents
In this paper a new approach for achieving the conceptual matching between user queries and web documents is presented. The key of the proposed system is to use FIS-CRM (Fuzzy Interrelations and Synonymy Concept Representation Model) to represent the indexed web pages. This model (also implemented in the FISS metasearcher) is supported by a fuzzy synonymy dictionary and various thematic fuzzy o...
متن کاملA Novel Approach for Web Document Classification
The web is a huge repository of information and there is a need for categorizing web documents to facilitate the search and retrieval of documents. Web document classification plays an important role in information organization and retrieval.This paper presents a fuzzy set based approach for automatically classifying web documents into one of the classes represented by a set of training documen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000